Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream by WenqingLan1 · Pull Request #769 · microsoft/superbenchmark

WenqingLan1 · 2025-12-19T20:05:13Z

Refinements:

Add support for float (fp32) execution and --data_type <float|double> CLI option for runtime type selection.
Refactor CUDA kernels to use 128-bit vectorized accesses (double2 / float4) and move template kernel implementations into a header for cross-TU instantiation. (Required for CUDA template instantiation across compilation units.)
Fix allocation buf size bug, args->size is buf size in bytes, not number of elements.
Adjust execution/output to single visible GPU (device 0 via CUDA_VISIBLE_DEVICES) and update metric/tag formats (removing gpu_id) plus docs/examples/test log.
Updated numa assignment from hard coded numa_alloc_onnode to numa_alloc_local to optimize memory allocation.
Rename entry point file from gpu_stream_test.cpp to gpu_stream_main.cpp.

Note: metric tag removed gpu_idx and the execution is per-process, so users need to update the configs & rules.
New config:

    gpu-stream:fp64:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 10
        num_loops: 40
        size: 4294967296
        data_type: double
    gpu-stream:fp64-correctness:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 0
        num_loops: 1
        size: 1048576
        data_type: double
        check_data: true
    gpu-stream:fp32:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 10
        num_loops: 40
        size: 2147483648
        data_type: float
    gpu-stream:fp32-correctness:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 0
        num_loops: 1
        size: 1048576
        data_type: float
        check_data: true

New rule:

    gpu-stream:
      statistics:
        - mean
      categories: GPU-STREAM
      aggregate: True
      metrics:
        - gpu-stream:fp(?:32|64)/STREAM_.*_(?:bw|ratio):(\d+)

Example results:

"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:0": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:1": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:2": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:3": 1234

Processed by rules:

| gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw | mean | 1234|

codecov · 2025-12-19T20:14:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.71%. Comparing base (6a1d02a) to head (a7f83d4).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #769   +/-   ##
=======================================
  Coverage   85.70%   85.71%           
=======================================
  Files         103      103           
  Lines        7907     7908    +1     
=======================================
+ Hits         6777     6778    +1     
  Misses       1130     1130

Flag	Coverage Δ
cpu-python3.10-unit-test	`70.48% <50.00%> (+<0.01%)`	⬆️
cpu-python3.12-unit-test	`70.48% <50.00%> (+<0.01%)`	⬆️
cpu-python3.7-unit-test	`69.90% <50.00%> (+<0.01%)`	⬆️
cuda-unit-test	`83.62% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Updates the GPU STREAM microbenchmark to support runtime-selectable FP32/FP64 execution and improve GPU memory bandwidth utilization, while aligning SuperBench integration (CLI, output tags, docs, and tests) to the new behavior.

Changes:

Add --data_type <float|double> to select FP32/FP64 at runtime and propagate it through the Python benchmark wrapper + unit tests.
Refactor CUDA kernels to use 128-bit vectorized accesses (double2 / float4) and move template kernel implementations into a header for cross-TU instantiation.
Adjust execution/output to single visible GPU (device 0 via CUDA_VISIBLE_DEVICES) and update metric/tag formats (removing gpu_id) plus docs/examples/test log.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`tests/data/gpu_stream.log`	Updates golden log output to include data type and new tag format (no `gpu_id`).
`tests/benchmarks/micro_benchmarks/test_gpu_stream.py`	Extends command-generation assertions to include `--data_type` (currently only covers `double`).
`superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp`	Removes NUMA/GPU iteration fields from args and adds `Opts::data_type`.
`superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp`	Adds CLI parsing/printing for `--data_type`.
`superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_main.cpp`	New entry point replacing the previous main file.
`superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp`	Introduces vector-type mapping and templated kernel definitions (128-bit loads/stores).
`superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu`	Keeps a CUDA compilation unit and moves template implementations to the header.
`superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp`	Expands bench-args variant to support `float` and `double`.
`superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu`	Uses local NUMA allocation, enforces 16B/thread sizing, launches templated vectorized kernels, updates tag format, and runs only CUDA device 0.
`superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt`	Switches target sources to the new `gpu_stream_main.cpp`.
`superbench/benchmarks/micro_benchmarks/gpu_stream.py`	Adds `--data_type` argument and forwards it to the binary.
`examples/benchmarks/gpu_stream.py`	Updates example invocation to include `--data_type double`.
`docs/user-tutorial/benchmarks/micro-benchmarks.md`	Updates gpu-stream metric patterns to include `(double

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 12 out of 14 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 11 out of 14 changed files in this pull request and generated 4 comments.

Copilot

Pull request overview

Copilot reviewed 11 out of 14 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:90

ParseOpts initializes size_specified to true, which means omitting --size will not trigger the missing-required-arg check, unlike the pattern used in other micro-benchmarks (e.g., cpu_copy/gpu_copy). If --size is intended to be required per PrintUsage, size_specified should start as false (and only be set true when parsed).

                                     {"check_data", no_argument, nullptr, static_cast<int>(OptIdx::kEnableCheckData)},
                                     {"data_type", required_argument, nullptr, static_cast<int>(OptIdx::kDataType)}};
    int getopt_ret = 0;
    int opt_idx = 0;
    bool size_specified = true;
    bool num_warm_up_specified = false;
    bool num_loops_specified = false;

Copilot

Pull request overview

Copilot reviewed 11 out of 14 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:99

size_specified is initialized to true, which makes the "required option" validation (if (!size_specified || ...)) ineffective for --size and inconsistent with PrintUsage() (which shows --size as required). Either initialize size_specified to false (to enforce --size) or remove size_specified from the required-check and update the usage text accordingly.

    int getopt_ret = 0;
    int opt_idx = 0;
    bool size_specified = true;
    bool num_warm_up_specified = false;
    bool num_loops_specified = false;

    bool parse_err = false;
    while (true) {
        getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
        if (getopt_ret == -1) {
            if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
                parse_err = true;
            }
            break;

Copilot

Pull request overview

Copilot reviewed 11 out of 14 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:89

size_specified is initialized to true, so the required-argument validation (if (!size_specified || ...)) can never fail due to a missing --size. This is inconsistent with other micro-benchmark option parsers in the repo (which start this flag as false) and with PrintUsage() indicating --size is required. Initialize size_specified to false (or remove the flag entirely if --size is intended to be optional) so missing/invalid size handling is unambiguous.

                                     {"data_type", required_argument, nullptr, static_cast<int>(OptIdx::kDataType)}};
    int getopt_ret = 0;
    int opt_idx = 0;
    bool size_specified = true;
    bool num_warm_up_specified = false;

Copilot

Pull request overview

Copilot reviewed 11 out of 14 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:92

ParseOpts initializes size_specified to true, which makes --size effectively optional even though the usage string lists it as required and other micro-benchmarks start this flag as false. This is likely unintended and makes the required-argument validation inconsistent. Initialize size_specified to false (or update the required-argument logic/usage text if --size is intentionally optional due to the default).

    int getopt_ret = 0;
    int opt_idx = 0;
    bool size_specified = true;
    bool num_warm_up_specified = false;
    bool num_loops_specified = false;

    bool parse_err = false;

WenqingLan1 added 3 commits December 18, 2025 01:16

remove fixed gpu id & numa id assignment

7f23c75

use 128bit alignment, add float support, cleanup

d63fe8c

add data_type arg

242714e

WenqingLan1 requested a review from a team as a code owner December 19, 2025 20:05

WenqingLan1 added the micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks label Dec 19, 2025

guoshzhao self-assigned this Dec 19, 2025

guoshzhao requested review from guoshzhao and polarG December 19, 2025 20:32

WenqingLan1 and others added 5 commits December 19, 2025 23:31

fix lint

e8d0282

fix clang lint

5a18946

update doc

fddf56e

Merge branch 'main' into wenqinglan/refine-gpu-stream

3c359a3

Merge branch 'microsoft:main' into wenqinglan/refine-gpu-stream

e445363

Copilot AI review requested due to automatic review settings February 3, 2026 22:14

Copilot started reviewing on behalf of WenqingLan1 February 3, 2026 22:15 View session

Copilot AI reviewed Feb 3, 2026

View reviewed changes

Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated

Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated

Comment thread docs/user-tutorial/benchmarks/micro-benchmarks.md Outdated

WenqingLan1 and others added 2 commits February 5, 2026 16:04

Merge branch 'microsoft:main' into wenqinglan/refine-gpu-stream

60b130c

fix alloc count & comment

f31933f

Copilot AI review requested due to automatic review settings February 6, 2026 00:20

Copilot AI reviewed Feb 6, 2026

View reviewed changes

Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu

Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu

fix: reset gpu-burn submodule to correct commit

d8a91ab

guoshzhao requested a review from abuccts February 13, 2026 00:11

guoshzhao reviewed Mar 26, 2026

View reviewed changes

Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated

guoshzhao reviewed Mar 26, 2026

View reviewed changes

Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu

guoshzhao reviewed Mar 26, 2026

View reviewed changes

Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp Outdated

guoshzhao requested changes Mar 26, 2026

View reviewed changes

Merge branch 'microsoft:main' into wenqinglan/refine-gpu-stream

2dfa122

Copilot AI review requested due to automatic review settings April 8, 2026 20:27

Copilot started reviewing on behalf of WenqingLan1 April 8, 2026 20:30 View session

Copilot AI review requested due to automatic review settings May 21, 2026 00:26

Copilot started reviewing on behalf of WenqingLan1 May 21, 2026 00:26 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

WenqingLan1 added 2 commits May 21, 2026 18:13

fix cuda11.1 build

ea3fd8e

fix doc

2b6ea7e

Copilot AI review requested due to automatic review settings May 21, 2026 18:20

Copilot started reviewing on behalf of WenqingLan1 May 21, 2026 18:21 View session

resolve comments

0fd405c

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream.py Outdated

WenqingLan1 added 2 commits May 21, 2026 11:44

fix syntax

4173759

fix lint

ee52086

Copilot AI review requested due to automatic review settings May 21, 2026 19:07

Copilot started reviewing on behalf of WenqingLan1 May 21, 2026 19:08 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread tests/benchmarks/micro_benchmarks/test_gpu_stream.py Outdated

Comment thread tests/benchmarks/micro_benchmarks/test_gpu_stream.py Outdated

WenqingLan1 and others added 2 commits May 21, 2026 12:58

resolve comment

80d5f0a

Merge branch 'main' into wenqinglan/refine-gpu-stream

ff9e254

Copilot AI review requested due to automatic review settings May 22, 2026 03:40

Copilot started reviewing on behalf of polarG May 22, 2026 03:40 View session

polarG approved these changes May 22, 2026

View reviewed changes

polarG enabled auto-merge (squash) May 22, 2026 03:41

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated

fix nvmldevicegetnumanodeid error

5e887ff

WenqingLan1 disabled auto-merge May 22, 2026 17:13

fix nvml call indx

18531e0

Copilot AI review requested due to automatic review settings May 22, 2026 17:27

Copilot started reviewing on behalf of WenqingLan1 May 22, 2026 17:27 View session

fix lint

b21284b

Copilot AI reviewed May 22, 2026

View reviewed changes

fix shadow ret

a7f83d4

WenqingLan1 enabled auto-merge (squash) May 22, 2026 22:44

Conversation

WenqingLan1 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

WenqingLan1 commented Dec 19, 2025 •

edited

Loading

codecov Bot commented Dec 19, 2025 •

edited

Loading